StartR Workshop
University of Konstanz
November 24, 2024
Photo courtesy of @lg17
Merging is the process of combining two (or more) data sets into one. Merging requires that the data sets have at least one variable in common, usually an ID variable.
There are four types of merging:
We use functinos from the dplyr package.
We create two data frames with the shared key variable id. They have observations from different participants and different variables.
Now we can use functions from the dplyr package to merge the data frames.
different names of the key variable
multiple key variables
dfA <- data.frame(id = c(1, 1, 2, 2), wave = c(1, 2, 1, 2),
anx = c(10, 8, 15, 16), dep = c(7, 9, 12, 11))
dfB <- data.frame(id = c(1, 1, 3, 3), wave = c(1, 2, 1, 2),
ang = c(2, 4, 11, 11), dis = c(5, 5, 3, 5))
dplyr::full_join(dfA, dfB, by = c("id", "wave")) id wave anx dep ang dis
1 1 1 10 7 2 5
2 1 2 8 9 4 5
3 2 1 15 12 NA NA
4 2 2 16 11 NA NA
5 3 1 NA NA 11 3
6 3 2 NA NA 11 5
Photo courtesy of @hansreniers
Reshaping is the process of transforming data without changing the data itself.
We start with a data frame in a wide format.
Transform data frame to long format with pivot_longer().
This time, we start with a data frame in a wide format.
Often variable names in the wide format contain more than one piece of information. For example, the variable a_1 contains information about the variable a and the time point 1. This is called a hidden identifier.
# from wide to long
dfr_long <- pivot_longer(
dfr, cols = c(a_1, a_2, b_1, b_2),
names_to = c("variable", "time"),
names_sep = "_",
values_to = "value")
dfr_long# A tibble: 8 × 4
id variable time value
<dbl> <chr> <chr> <dbl>
1 1 a 1 10
2 1 a 2 7
3 1 b 1 2
4 1 b 2 5
5 2 a 1 8
6 2 a 2 9
7 2 b 1 11
8 2 b 2 5
Aggregation is the process of combining multiple observations into a single observation.
There are two types of aggregation:
We use functinos from the dplyr package.
Compute various colum-wise statistics:
MEAN
1 7.666667
# Compute mean and standard deviation of a single variable
summarize(dfr, MEAN = mean(a_1), SD = sd(a_1)) MEAN SD
1 7.666667 2.516611
a_1 b_2
1 7.666667 4.333333
# Compute mean and standard deviation of multiple variables
summarize(dfr, across(c(a_1, b_2), list(MEAN = ~mean(.x), SD = ~sd(.x)))) a_1_MEAN a_1_SD b_2_MEAN b_2_SD
1 7.666667 2.516611 4.333333 1.154701
Compute various row-wise statistics:
We can aggregate for different groups separately using group_by().
# Define a wide data frame
dfr <- data.frame(id = c(1, 2, 3, 4), condition = c("A", "A", "B", "B"),
a_1 = c(10, 8, 5, 7), a_2 = c(7, 9, 2, 5),
b_1 = c(2, 11, 8, 4), b_2 = c(5, 5, 3, 1))
dfr id condition a_1 a_2 b_1 b_2
1 1 A 10 7 2 5
2 2 A 8 9 11 5
3 3 B 5 2 8 3
4 4 B 7 5 4 1
Aggregate for conditions A and B:
A loop is used to repeat a sequence of commands multiple times, each time using a different value of a loop index.
A loop consists of
i) that takes on different values1:3) with all values that the loop index should takeprint(i)) with the commands to be executed for each value of the loop indexThese elements are combined in a loop statement:
for(index in vector){expression}
Let’s use the example from above, where we computed means for different columns of a data frame.
# Define a wide data frame
dfr <- data.frame(id = c(1, 2, 3), a_1 = c(10, 8, 5), a_2 = c(7, 9, 2),
b_1 = c(2, 11, 8), b_2 = c(5, 5, 3))
for(column in c("a_1", "a_2", "b_1", "b_2")){
var <- dfr[, column]
MEAN <- mean(var)
print(MEAN)
}[1] 7.666667
[1] 6
[1] 7
[1] 4.333333
Loops are rarely efficient, but especially beginners often use them because they are intuitive.
A conditional is used to execute commands only if a certain condition is met. They typically consist of if and else statements.
[1] "1 is larger than 0"
A frequently used special form is the ifelse() statement, which can be used to replace the values of a vector depending on a condition.
[1] "smaller" "smaller" "smaller" "smaller" "smaller" "larger" "larger"
[8] "larger" "larger" "larger"